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This paper is an invited commentary on Tamds Budavdri's presentation, "On statistical 
cross-identification in astronomy," for the Statistical Challenges in Modern Astronomy V 
conference held at Pennsylvania State University in June 2011. I begin with a brief review 
of previous work on probabilistic (Bayesian) assessment of directional and spatio-temporal 
coincidences in astronomy (e.g., cross-matching or cross-identification of objects across 
multiple catalogs). Then I discuss an open issue in the recent innovative work of Budavari 
and his colleagues on large-scale probabilistic cross-identification: how to assign prior 
probabilities that play an important role in the analysis. With a simple toy problem, I show 
how Bayesian multilevel modeling (hierarchical Bayes) provides a principled framework 
that justifies and generalizes pragmatic rules of thumb that have been successfully used by 
Budavari 's team to assign priors. 

Budavari's paper reviews the key concepts of a recent body of research by him 
and his colleagues on Bayesian cross-matching of astronomical object catalogs. 
When object directions have quantified uncertainties (e.g., error circles with con- 
fidence levels), this approach offers significant advantages over more conventional 
approaches that attempt to assess directional coincidences using ad hoc statistics 
(e.g., nearest neighbor angles, counts in cones, ^ 2 -based statistics, likelihood ratios) 
and p-values. In the mid-1990s gamma-ray burst (GRB) astronomers developed es- 
sentially the same approach for assessing evidence for repetition of GRBs [8]|4]|6l, 
and for association of GRBs with unusual supernovae [5|. This work pre-dates the 
discovery of GRB X-ray afterglows; the available GRB data provided direction es- 
timates with large uncertainties (5° to 25° error circles for directions from BATSE 
data; many-arc-minute error boxes for interplanetary network direction estimates). 
Budavari seeks to cross-match optical and UV catalogs that are much larger in size 
than GRB catalogs, and with much more accurate directions. This is a comple- 
mentary regime, raising unique challenges for Bayesian cross-matching — especially 
computational challenges — that Budavari's team is addressing with innovative tech- 
niques just briefly touched on in his paper (e.g., see J5] [Tj 13). 
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For this commentary I am taking a cue from Budavari's Discussion section, 
where he states that a "fully Bayesian solution would include a hierarchical model" 
but that "the computational requirements do not justify the extra work." Luo, Loredo 
& Wasserman (|8|, LLW96) developed a hierarchical (multilevel) Bayesian frame- 
work for assessing directional and temporal coincidences in a GRB catalog; an exact 
calculation was indeed impossible, and LLW96 had to rely on an unsatisfying ap- 
proximate treatment. However, for an important issue that Budavari discusses, a 
simple multilevel model is both computationally accessible, and illuminating. 

Specifically, Budavari highlights the important role of the prior probability for 
association, Pq, in Bayesian cross-matching; it is needed to convert marginal like- 
lihoods (or Bayes factors) for association to posterior probabilities (or odds). But 
Budavari's Po is determined using the data; it cannot really be a prior probability. 
A multilevel model not only enables estimation of Pq, but also can account for its 
uncertainty, which should play a role in assessing the method's performance in sim- 
ulation studies (e.g., determining whether the discrepancy between estimated and 
true association fractions in Budavari's Fig. 4 is acceptable). A multilevel treatment 
also illuminates other issues important for probabilistic cross-matching. In the lim- 
ited space available here I describe a simple example calculation illustrating the 
main idea; a more complete and general treatment will be published elsewhere. 

Suppose we have a "target" catalog of N t newly detected objects, and we would 
like to determine if some or all of them are associated with any of N c previously 
detected objects in a candidate host or counterpart catalog spanning the same re- 
gion of the sky. From the target observations, analysis of the data associated with 
object number i produces a likelihood function, £j((0), for its direction, co; the tar- 
get catalog provides summaries of these likeihood functions (e.g., best-fit directions 
and error circles when uncertainties can be accurately described by Fisher distribu- 
tions). Similarly, m/ ( ((o) is the likelihood function for the direction to object k in 
the candidate host catalog. Suppose the cataloged hosts are a sample from a large 
population with a known (or well-estimated) directional distribution, p c (ft)), e.g., an 
isotropic distribution with p c = 1 /An. Then the posterior distribution for the direc- 
tion to candidate host object k is p c (co)mi < (co)/Zi c , where the normalization constant 
Zfc = / dco p c (co)mic((o). The marginal likelihood that target ; is associated with host 
k (thus sharing a common direction) is 



The marginal likelihood that target i is instead from a background population of 
hosts with direction distribution Po(co) is 



The Bayes factor in favor of association of target ; with host k versus a background 
source is bn_ = hik/gi- When the direction likelihoods are proportional to Fisher 
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distributions and the host and background densities are isotropic, this corresponds 
to Budavari's B (also derived earlier by LLW96 and (0]). 

Of course, we do not know a priori which candidate host to assign to each target. 
The marginal likelihood that target i is associated with one of the candidate hosts 
must account for this uncertainty by introducing a prior probability for the host 
choice, say 1 /N c , and marginalizing over k\ the resulting marginal likelihood is 

^=^=w^f daB ^¥^^ (3) 

Budavari introduced a prior probability for association, Pq, in order to convert 
marginal likelihoods (or Bayes factors) to posterior probabilities (or odds). Using 
intuitively appealing arguments, he develops equations to determine a value for Pq, 
but they use the data, and thus Pq is not really a prior probability, and his posterior 
probabilities are not formally valid. To better motivate and extend Budavari's ap- 
pealing results, we make the association model a multilevel model, introducing a 
population parameter that we will estimate from the data. 

Define the target population association parameter, a, as the probability that a 
randomly selected target comes from the population of cataloged candidate hosts (so 
1 — a is the probability that a target comes from the background). Were a known, 
the posterior probability that target i is associated with one of the hosts would be 

But typically a will not be known a priori; in fact, estimating a may be a significant 
scientific goal. The likelihood function for a is the probability for the target data, 
given a and the host catalog information; using the above results, it is 

N t 

^(a)=n [(!-«)# + «*«]• (5) 
1=1 

A straightforward calculation shows that the maximum-likelihood value of a, a, 
satisfies the following equation: 

J>(ft) = &N t - (6) 

i 

This is an intuitively appealing result: for the maximum-likelihood value of a, the 
sum of the association probabilities is equal to the expected number of targets with 
associations. 

To see the connection with Budavari's rule for assigning Pq (his equation (10)), 
suppose the data provide direction estimates with uncertainties that are small com- 
pared with the angles between hosts. Then the sum in the marginal likelihood for 
association for a target object, equation (|3), will typically be dominated by just one 
term, so 
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where k(i) specifies the index of the host that is the nearest neighbor to target i (in the 
sense of having the largest marginal likelihood term). If we use this approximation 
for hi in equation for Pi{tt), then equation (|6]l becomes equivalent to Budavari's 
equation (10) (identifying a with his Pq, pj with his P, and &N t with his iV*), for the 
case of two catalogs. 

This calculation does more than simply justify Budavari's intuitive arguments 
for setting Pq. One concrete benefit is that it enables accounting for uncertainty in 
a. Combined with a prior for a, the likelihood function in equation <(5j produces a 
posterior for a. If the prior is not highly informative, the posterior will be asymp- 
totically normal, with a mean close to a and a variance, <7„, that can be found by 
calculating the second derivative of ln[jSf (a)] at d; the result is 



where pi = pi{&). Two limiting cases are illuminating. Suppose first that the target 
positions have very large uncertainties. In the limit where £i(co) — > C, a constant 
(i.e., uninformative data), we have g, = /?,■ = C. The Bayes factor for association of 
each object is unity (indicating the data provide no information to alter prior prob- 
abilities), and the likelihood function for a is flat, so there is no unique a value. 
The right hand side of equation (H) vanishes, implying divergence of the variance 
(actually, the asymptotic approximation is not valid with a flat likelihood function). 
The data provide no information about the association fraction in this case, as one 
would expect. Now consider the opposite limit where the direction uncertainties are 
small, leading to unambiguous associations (very large Bayes factors), so that for 
values of a away from zero or unity, p, ■ w or 1 . In this case, equation © tells 
us that a = N+/Nt, where N+ is the number of targets with pi « 1. Equation (HJ 
indicates that in this limit, a a — > 1 / \fN~t, again an intuitively reasonable result. For 
intermediate cases, where there is evidence for associations but with some ambigu- 
ity, the uncertainty in a will be larger than "root-Af," by an amount depending on the 
variance between the /3, values and a. Calculating a a for the SDSS-Galex example 
in Budavari's Section 4 may be helpful in assessing the discrepancy between the 
estimated and input values of Pq. 

In the SDSS-Galex example, Pq was over-estimated; Budavari attributes this to 
confusion due to chance proximity of objects in each catalog. But one of the aims 
of probabilistic modeling of directional coincidences is to account for this sort of 
confusion. An accurate Bayesian calculation will account for it, resulting in no 
significant bias in estimation of the association fraction, but possibly increased a 
uncertainty when the directional uncertainties lead to significant counterpart confu- 
sion. When a particular target has multiple plausible associations, the probability for 
association will be split across them. One way to see how the Bayesian calculation 
handles counterpart ambiguity is to rewrite the likelihood function to more explic- 
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itly display how it accounts for each possible association. First introduce unifying 
notation for the components in the likelihood factor for a particular target: define 
weights Wk, with wq = 1 — a and = a /N c for k = 1 to N c , and let = gj when 
k = 0. Also introduce target labels A; that take values from to N c . Then equation (0 
can be written as 



where the last sum is over all label assignments, and m^(A) is the multiplicity for 
host k, counting the number of targets with A, = k in a particular term of the sum. 
This sum-of-products decomposition displays the likelihood as a weighted sum of 
terms considering every possible assignment of targets to candidate hosts. If we 
adopt the best-candidate approximation of equation ©, only a small fraction of the 
terms is considered; when confusion is important, additional terms in hi should be 
kept so that the calculation accounts for all plausible associations. 

Equation (O also reveals an unsatisfactory aspect of the model I have described 
here: it allows for all possible host multiplicities, in particular, it allows for as- 
signing two targets to the same host. In some settings this is desirable, e.g., for 
constraining GRB repetition, or for determining whether ultra-high energy cosmic 
rays come from nearby active galaxies. But in many settings — including the SDSS- 
Galex case — it is only meaningful to assign targets to distinct hosts. This argues that 
the sum-of-products version of the likelihood function is the more fundamental rep- 
resentation to use for building coincidence assessment models; for the SDSS-Galex 
case, the sum over labels would be constrained to ensure distinct associations. This 
is why LLW96 adopted this representation for developing a general framework for 
spatio-temporal coincidence assessment. 

As a final remark on the value of an explicit multilevel model for associations, 
recall that we needed to assign a prior probability for the host choice, taken as 1 /N c 
in equation (01; in the sum-of-products version of the likelihood function, this as- 
signment appears in the w\ factors. More generally, the candidate host prior may not 
be constant; it could depend, for example, on host distances and luminosities, and 
this affects estimation of a. It is straightforward to account for this in a multilevel 
model, though it can complicate the calculations. The paper in these proceedings 
by Soiapom et al. briefly describes work by my team based on just such a model, 
developed to assess evidence for association of ultra-high energy cosmic rays with 
local active galaxies. 

Budavari developed his Bayesian approach from scratch, unaware of earlier work 
on the problem in the GRB literature. In fact, that work was well-hidden, tersely 
presented in short papers in conference proceedings. More extensive treatments did 
not follow because it proved extremely difficult to get funding to further develop 
the approach; reviewers expressed strong skepticism of Bayesian methods. To cite 
one ironically relevant example, the report from a 2005 NVO proposal review panel 
asserted that the Bayesian approach offered "nothing new" for the problem, and 
that its implementation "would not be much more than a 'few-liner' addition to 
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Xmatch," the ^ 2 -based NVO cross-match algorithm now made obsolete by Bu- 
davari's Bayesian algorithm. With this frustrating history, it has been a delight to 
see Budavari's team not only rediscover the approach, but also make significant and 
highly nontrivial statistical and computational innovations mating it to the needs of 
VAO users. 
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